FOSDEM’15

The OPW internship that I did between May and August 2014 offers sponsorship for attending a technical conference. I chose to go to FOSDEM which took place in Brussels, between 31 Jan and 1 March (Saturday and Sunday). It’s a free event for software developers and open source organizations to meet and share ideas. There are a ton of presentations on different topics you can attend: hardware, kernel, programming languages, distributed systems, web stuff, you name it. So, if you work or are interested into the IT industry, you’ll definitely find some interesting talks there.

I arrived in Brussels two days before the conference because I wanted to have some time for visiting. I was lucky because the weather was perfect for visiting so I explored the city all day long. Brussels is famous especially for chocolate and beer, so I had to test them both to convince myself. :-).

On Saturday morning I arrived at the Universite Libre de Bruxelles where the conference was taking place. On their website you could download the schedule for a particular day. There, you could see all the main topics and each presentation was assigned to one large topic. All presentations belonging to the same topic were held in the same room. The schedule also had a map so you could know where to find a particular room in the campus.

I’m currently interested in cloud computing and distributed systems so I got in the room where ‘Infrastructure as a service’ related presentations were held. The first presentation I attended was ‘Build distributed Fault-Tolerant systems with Apache Mesos’. Dave Lester, who presented was so impressed by the number of attenders that he took a picture with us and posted it on Twitter :-).My OPW internship was actually at Apache Mesos so I was very curious what the presentation would be about. At the beginning, it was an introduction into Apache Mesos, what is its purpose, how it achieves hardware abstraction for different kind of frameworks, what entities it’s composed of (master, slaves, scheduler, framework) and what are their purpose and also how the job scheduling process works. This was followed by a recorded demo, a sort of ‘Hello world’ applied to Apache Mesos, where a ‘print Hello FOSDEM’ command was ran on a set of machines. Basically one task was running over and over and Mesos was picking a machine (from a pool of machines) with free resource to run the command. In the end, Dave talked about Mesos at Twitter and how it improved the scalability especially during events like World Cup where the amount of traffic was increasing exponentially. He said that in 2010, before adopting Mesos and a more complex architecture, at each World Cup football match, when there was a goal or fault, the website was going down :-).

The next presentation I attended was called “GlusterFS -Overview & Future Directions”. I didn’t hear about GlusterFS before, but since it was held in that particular room, I assumed it was a distributed file system. Niels de Vos who is one of Gluster’s maintainers wanted to cover a few things throughout the presentation: give a short overview of the network file system, explain its components (servers, clients, storage bricks, volumes) and how they interconnect to form a scalable file system, talk about the current features, upcoming release and do a small demo about how to set up Gluster. Unfortunately he managed to cover just the first part because after 10 minutes of presentation, people started to ask questions over and over again. Basically Niels wasn’t able to continue its presentation because he answered all the questions. I find this a bit strange and it felt like he ran out of control with the talk. I can’t say whose fault was: the audience who should have left the presenter to continue the talk, or the presenter who maybe should have stopped the questions and answer them after the presentation was finished.

After that I attended two presentations related to OpenStack: “Cinder – the state of block storage in OpenStack” and “OpenStack and Xen”. Since OpenStack is very popular nowadays in the Cloud Computing area, I wanted to find out more about this subject. OpenStack is a very large and complex project used to obtain an IaaS platform. It has a lot of components each one with a different purpose: Nova (compute service), Glance (handles image service), Swift (object store), Keystone (does authentication), Horizon (the Web interface), Cinder (the block store). There are more of them but these are the basic infrastructure services.

On the first OpenStack presentation I found out some insights about Cinder component: how it achieves block storage (device level) abstraction and provides the user an API in order to use the resources (basically provides persistent storage to guest virtual machines), what are the main features (snapshot management, volume management) and some upcoming features.

The second OpenStack presentation was held by two guys: one that did the talking and one that actually made the presentation and the demo (this is what they said 🙂 ). First they presented some things about the Xen hypervisor, that it’s very secure and it’s used in the aviation and car industry because of that, and also the entire Amazon cloud is based on it. After the introduction came the integration of OpenStack on top of Xen. The component of OpenStack that is aware of the hypervisor underneath is Nova (the computing engine). Nova “talks” to Xen through an API driver called Libvirt which manages the VMs. After the overview there was a demo about how to deploy OpenStack on a Xen hypervisor: install your favorite Linux distro, install Xen (with apt-get install), reboot, clone the DevStack repo which offers a script that will install OpenStack, edit the script to specify Libvirt to use Xen, and run the script.

The second day I went to the ‘Security devroom’ where security presentations were being held. This is because the company I currently work for focuses on security products, and since the cloud stuff (my main interest) was covered in the first day, I thought that I should give security a try as well.

The first presentation I attended was “Web Security” by Habib Virji who is part of Samsung’s Open Source Group. The two topics were CSP (Content Security Policy) and Web Criptography. I found out about “Open Web Application Security Project (OWASP)”, an online community dedicated to web application security, which has a top ten list with web security threats. Two of them addressed in the presentation were “Cross-site Scripting (XSS)” and “Sensitive data exposure”. XSS can inject client side scripts in a web page. It can be prevented with CSP by allowing website owners to declare in a standard HTTP header, approved sources of content, that browsers should be allowed to load on that page. For the “Sensitive data exposure” issue the answer was “Web Cryptography API”, a JavaScript API capable of Hashing, Encryption/Decryption and Signature generation/verification.

The second presentation was “The Fuzzing Project” by Hanno Bock, who, to my surprise, is a freelance journalist from Berlin that contributes to open source in his free time J. Fuzzing is a testing technique that provides random data to the input of a program. This can lead to crashes or even discover security vulnerabilities that can allow an attacker to take over the system. “The Fuzzing Project” has some tutorials and file samples to help you get started with fuzzing and also a list of free software projects and how well they resist fuzzing attempts. I found out that many of the Linux tools I am using on a daily basis have security vulnerabilities, mostly because of bad memory management in C language. Being mostly a C coder, this made me more aware of security threats and how I should test the software I am developing.

The last presentation was “BIFUZ” and was held by two compatriots from Intel Romania J. BIFUZ stands for “Broadcast Intent FUZzing Framework for Android” and is an open source testing tool which can find Buffer Overflows, Java exceptions and other security vulnerabilities by fuzzing intents in Android applications. An “Intent” is a messaging object you can use to request an action from another app component. Bugs found can be easily reproduced and sent to Google for verification. Intel has recently entered the mobile phones and tablet markets with its Intel Atom processors and it seems BIFUZ was used to test its compatibility with Android OS.

To sum everything up, it was a nice experience for me, travel and open source worked magnificent together 🙂 . This conference motivated me to continue to work on open source projects and be part of open source communities.

OPW Pencils Down

The summer has almost ended and so has my OPW internship @ Apache Mesos .
I had a great time working on my project and it was a good learning experience for me.

In this blog post I will summarize my work on the project, generically called ‘Slave unregistration in Mesos‘. I will also talk about the valuable things I have learned and my overall impression. This will be quite a long post, so I will brake it into sections, in case you want to skip reading some parts:
1. What I have done (technical aspect)
2. What I have learned (non-technical)
3. Thanks (non-technical)

Technical aspect – What I have done during the summer

The initial purpose of the project was to drain (immediately), on demand, mesos slaves. Draining immediately a slave implies killing all its underlying tasks/jobs. The draining should start when the slave receives a SIGUSR1 signal. Before shutting down, the slave should unregister with the master so that the master knows as soon as possible that the slave won’t be available any more. Continue reading →

‘drain now’ mechanism

My OPW project has several parts (I talked about them on my previous post here). I started my work with the first part which is called “the drain now” mechanism. This mechanism aims to drain the slaves when SIGUSR1 signal is sent to them.

‘draining’ in this context means killing an entity (process) and all that it’s running underneath that entity (all the processes that were started by that entity). So ‘draining the slave’ means killing the slave processes and all the jobs/tasks the slave is running.

For the first part of the mechanism, I implemented a signal handler which is triggered when SIGUSR1 signal is issued. The handler kills everything that runs underneath the slave and after that shuts down the slave.

For the second part of the mechanism, I adjusted the first part by sending an unregister request message from the slave to the master, before shutting down the slave. This is because normally, the master waits up to the health checking delay (~75 seconds) until it considers the tasks that were running on the slave as lost. With this change, the master will mark the tasks as lost (and do all the necessary things when a task becomes lost) as soon as it will receive the unregister request message. In addition, the master will remove the slave from its lists.

At a first glance all this seems pretty easy to do, but things tend to get a bit complicated when you’re working on a big project and lots of companies relay their infrastructure on it (including Twitter!!). Every piece of code that you add has to be perfect so that the application remains efficient and without bugs. So my patch had a couple of review iterations until it was ready to be committed. My mentor (Ben Mahler) helped me a lot with reviewing all my code and giving me tips.

So as a learned lesson : ‘keep it simple’. When you have something to do, even if it looks easy, think about it twice, maybe there’s even an easier way you could do it.

Thanks for reading and have a great day 🙂 !

“Slave unregistration in Mesos” – brief description

The project I will be working on is called “Slave unregistration in Mesos”. Before going into more detail about it, I will briefly explain Mesos’ architecture, so that my project will make more sense.

It is said that a picture values a million words, so here’s a picture which describes Mesos’ architecture (source here).

As you can see in the picture, Mesos has a distributed architecture, with several entities:

Master
There is only one leading master per cluster; for high availability there can be more masters but only one is active

Slaves
These are hosts in the datacenter and one slave daemon is running on each host; these hosts are the ones on which all the tasks are run

Zookeeper
It is responsible for the leader election between the masters and also is used by Mesos slaves to discover the master

Frameworks
These are applications that are doing analytic things (Hadoop, Spark, Aurora, MPI) which run their tasks on top of Mesos.
Multiple frameworks can run on the same Mesos cluster because Mesos provides good resource isolation through Linux containers.
A framework consists of two parts:

scheduler
– is responsible for scheduling jobs/tasks
– it receives resource offers (memory, cpu, disk) from the master, which are available in the cluster, and it will use them to launch tasks

executor
– is responsible to execute the tasks the scheduler wants to launch on the slave
– it is launched by the slave when a task is launched by the scheduler on the slave

Initial motivation of the project

Sometimes a Mesos cluster has hundreds or thousands of slave machines. From time to time, some of them need maintenance (eg. upgrade the system). Up until now, when an operator wanted to do some maintenance work on a slave, he manually connected on that slave and killed the slave process. This approach will however, leave all tasks running on the machine. So a mechanism is needed that will drain the slave (kill all the tasks and the slave daemon).

Here comes in hand the first part of my project which consists of two parts:

the ability to kill all the tasks that were running within the slave and after that shutdown the slave daemon, on demand.
the problem with (1) is that the master will wait up to the health checking delay (~75 seconds) to notify the framework that the tasks were lost. This is why, before shutting down, the slave will send an unregistration message to the master.

Basically the two items above represent the ‘drain now’ mechanism, because the slave will be drained immediately, when the signal is triggered.

The problem with the ‘drain now’ mechanism is that it basically kills all the jobs that were running on the slave which sometimes is not desired. This is because some of them are long term jobs which take a lot of time and resources to be rescheduled and get to the point where they had been before the slave was shutted down. Therefore the followings will also be implemented:

the ability to deactivate slaves. This means that the master will no longer send resource offers belonging to that slave. As a result, frameworks can no longer launch new tasks on the slave. The tasks that had been launched before the slave was deactivated will continue running.
the ability to ‘inverse offers’ from frameworks, which means that the master is requesting the frameworks to return the resources they were using, in a particular amount of time; if the resources are not returned in time, they will be forcible revoked.

So this is the description of my project. Please stay tuned for updates.

Thanks for reading and have a great day 🙂 !

Init

Hello everybody! This is a technical blog and it will mainly be about my progress on a project I am currently working on.

At the end of April (2014) I found out that I was accepted at ‘Outreach Program for Women‘ (short OPW). I was very happy (and still am) about this because now, I have the opportunity to contribute to the open source community. I have done some contributions to several open source applications before, but they were small fixes and I always wanted to do something significant, with greater impact.

First, let me introduce what OPW is. In short terms, it’s kinda’ like GSOC but for women. In long terms, it’s an internship program organized by the GNOME Foundation, where participants can contribute to open source applications. You can participate only if you are a woman; if not, there’s GSOC which offers a lot of opportunities as well. Also, OPW doesn’t restrict you in being a student so feel free to apply no matter what age you are.

Basically the purpose of this program is to encourage women to contribute to open source. In the past, the number of females that were actively involved in FOSS communities was very low, so they definitely needed more support and encouragement from organizations.

If you are not a programmer, don’t worry, OPW offers non programming projects as well, that include user experience design, graphic design, documentation, web development, marketing, translation, etc.

For more information, please check out their website here.

The project I will be working on is for Apache Mesos, a cluster management framework. Mesos is used by many companies(including Twitter) that have large clusters and want a better and easier management for them.

For more information, please check out their website here.

So, this is all for the introduction post. For more information about my progress on the project, please stay tuned.

Thanks for reading and have a great day 🙂 !

Alex's tech soup

The greatest WordPress.com site in all the land!

Menu

FOSDEM’15

OPW Pencils Down

‘drain now’ mechanism

“Slave unregistration in Mesos” – brief description

Init